storage: fsync sideload sst writes every 2MB #20449

dt · 2017-12-04T18:09:34Z

#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.

cockroach-teamcity · 2017-12-04T18:09:41Z

This change is

a-robinson · 2017-12-04T19:16:44Z

🎉 Have you tested this out to make sure data gets flushes as intended? Would you like me to?

Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

pkg/storage/replica_proposal.go, line 341 at r1 (raw file):

		}

		if err := fileutil.WriteFileSyncing(path, sst.Data, 0600, 512<<10); err != nil {

Nit, but is there any reason for all callers to have to provide 512<<10 instead of having a named constant somewhere or just not even keeping it as an argument to WriteFileSyncing?

pkg/util/fileutil/syncing_write.go, line 22 at r1 (raw file):

)

// WriteFileSyncing is esseially ioutil.WriteFile -- writes data to a file named

esseially

pkg/util/fileutil/syncing_write.go, line 52 at r1 (raw file):

	}

	if err == nil {

Shouldn't we close the file even if there was an error? Leaking the file descriptor doesn't seem like a good idea.

Comments from Reviewable

tbg · 2017-12-04T19:38:47Z

mod @a-robinson's comments.

Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

Comments from Reviewable

dt · 2017-12-04T19:55:00Z

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

pkg/storage/replica_proposal.go, line 341 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Nit, but is there any reason for all callers to have to provide 512<<10 instead of having a named constant somewhere or just not even keeping it as an argument to WriteFileSyncing?

Done.

pkg/util/fileutil/syncing_write.go, line 22 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

esseially

Done.

pkg/util/fileutil/syncing_write.go, line 52 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Shouldn't we close the file even if there was an error? Leaking the file descriptor doesn't seem like a good idea.

Whoops, yes. I meant to close it unconditionally and overwrite err with close's err only when it was nil.
Done.

Comments from Reviewable

dt · 2017-12-04T19:58:45Z

I haven't tested it yet -- was going to pick up your rocks config change too and try out a build with both of them, which, theoretically, should have no giant writes happening during a big restore.

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

Comments from Reviewable

a-robinson · 2017-12-04T20:13:32Z

pending testing. The necessary rocksdb changes are in as of this morning. One thing to check to make sure this is working as intended is that during the restore you don't see the output of grep Dirty /proc/meminfo grow too much. In my testing with two rocksdb instances (one primary and one temp) it never grew past 10MB when things were working properly, and regularly got into the high 10s of MBs or hundreds of MBs when things weren't working properly. Unless we're writing a lot of these files in parallel, I'd expect it to stay below 10MB of dirty data.

Reviewed 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

Comments from Reviewable

dt · 2017-12-05T16:12:46Z

Good news and bad news.

2GB restore, 4 node roachprod gce cluster:

Good news: On master, running grep Dirty /proc/meminfo every second, peaked at Dirty: 1313672 kB. With this change: Dirty: 448 kB.

Bad news: that 2GB restore went from 36s to 139s. Yikes.

a-robinson · 2017-12-05T17:31:58Z

Ouch, that's pretty brutal. As discussed at lunch, syncing more than 512KB at a time may help. We may be able to parallelize a bit as well if that doesn't work out, but hopefully syncing a little more at a time gets us back to a more reasonable place.

It's nice that we won't have more than a gigabyte of dirty data waiting to be flushed anymore!

dt · 2017-12-05T17:56:58Z

Syncing every 4mb was more like 117s and every 8mb got it to 101s. Interestingly, even at 8mb, I still didn't observe a dirty number over ~300kb, though I was only polling once per second.

dt · 2017-12-05T20:30:16Z

one theory was that we could handle more concurrency now that we're not waiting for syncs, but, at first glance, upping the concurrent import request limit just makes it slower and bring back liveness errors :/

more digging is required.

dt · 2017-12-05T22:07:06Z

bumping importRequestLimit from 1 to 2 seems to cause a mild slow-down, and going to 3 worsens it and even causes sporadic slow heartbeat messages, so it doesn't look like turning that knob alone is a win.

maybe we want a background fsync and some syncs-in-flight limit, but eh -- that feels slightly like re-inventing what the kernel is supposed to be doing, and we do actually want ssts synced before we call the restore done, which is very much is not currently.

a-robinson · 2017-12-06T15:19:00Z

Well that's frustrating. You and the bulk I/O team will have to decide what sort of performance hit is acceptable.

I was wondering, though, did you test with vs without fsync for restores larger than 2GB? It'd be useful to know how the difference scales -- is fsyncing always 2.5x slower, or does it just add 60 seconds to the end of a restore, or somewhere in between?

petermattis · 2017-12-06T15:23:03Z

It might be a correctness bug that the sideloaded sstables were not previously being synced. An inopportune crash could remove data. I think we have to sync these files after they are written. See also https://stackoverflow.com/questions/15348431/does-close-call-fsync-on-linux.

dt · 2017-12-06T16:02:19Z

Yeah, agreed -- especially since once that restore finishes, we're happy to serve reads against those sstables, and on that 2gb restore, it looked like we still had 650mb+ reported as dirty well after RESTORE returned. So some of the "slow down" here is just that the previous restore numbers were a little misleading, since we hadn't actually written to disk after all.

Syncing every 8mb and syncing only after writing each file looks about the same, but I still saw a few slow heartbeat warnings. 4mb and 2mb each look like they add a little more slowdown, presumably indicating that we're waiting for syncs when we could be getting more work done, but get rid of liveness complaints.

Switched it to a setting, so we can continue to play with it. RFAL.

benesch · 2017-12-12T15:44:55Z

According to my read of the ext4 docs—and my understanding is that all of our local SSDs are mounted as ext4—the mount options default to commit=5s if unset, which means that all dirty data/metadata should be flushed to disk every 5s, even in the absence of fsync calls. I don't have a great intuition for how long syncing 600MB should take, but I'd think on the order of a few seconds. So I guess I don't understand why so many pages were dirty for so long in a non-sycing restore. When you say well after, @dt, do you mean a couple seconds or a couple minutes?

It is of course quite possible that I'm misreading/misunderstanding something.

benesch · 2017-12-12T15:48:27Z

I'd also be curious to see if syncing once, right before the restore commits, would be sufficient.

In any case, these aren't complaints about the approach in this PR! Just musings.

dt · 2017-12-12T16:11:00Z

@benesch just for a couple seconds... but that is still after we'd published the table descs and have promised durability.

After switching to a cluster setting, I did a trial run with the syncSize = 128mb, which should mean just one sync per sst, after writing the whole file, and, while that was a tiny bit faster than 4 or 2mb, it was still in the ~100s range, not the <40s range of no-sync, though, I was seeing some heartbeat complaints on those runs.

dt · 2017-12-12T16:15:55Z

Oh, you meant one global sync, rather than individual files, sorry. While I guess that would get us consistency at the SQL level, since we probably don't care about those ranges until the RESTORE completes, I'm a little wary of the individual AddSSTable commands completing without having the data safely on disk? Also, the original motivation for the smaller syncs was partly to avoid avoid the QoS issue, where the IO scheduler blocks all our small writes to chew through the big one.

dt · 2017-12-19T20:22:51Z

Any objections to merging this, as-is, and then investigating any improvements we want to make in followups? As-is, we're kinda lying by not syncing at all.

a-robinson

No objections from me. A benchmark (or at least instructions to do the same test run you've been doing) would be nice, though.

benesch · 2017-12-27T17:43:42Z

Oh, you meant one global sync, rather than individual files, sorry. While I guess that would get us consistency at the SQL level, since we probably don't care about those ranges until the RESTORE completes, I'm a little wary of the individual AddSSTable commands completing without having the data safely on disk? Also, the original motivation for the smaller syncs was partly to avoid avoid the QoS issue, where the IO scheduler blocks all our small writes to chew through the big one.

Yeah, one sync at the very end is definitely not viable from a raft consistency perspective. It's just shocking to me that periodic fsyncs cause such a slowdown when the data in question can be flushed to disk in a few seconds.

cockroachdb#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.

dt requested review from tbg, a-robinson and a team December 4, 2017 18:09

dt force-pushed the syncsst branch from 2566f53 to f5a1a89 Compare December 4, 2017 19:55

dt force-pushed the syncsst branch from f5a1a89 to 808b7c5 Compare December 5, 2017 19:18

dt force-pushed the syncsst branch from 808b7c5 to b6fcae5 Compare December 6, 2017 19:43

dt requested a review from a team December 6, 2017 19:43

dt changed the title ~~storage: fsync sideload sst writes every 512kb~~ storage: fsync sideload sst writes every 2MB Dec 19, 2017

a-robinson approved these changes Dec 19, 2017

View reviewed changes

storage: fsync sideload sst writes every 512kb

8409758

cockroachdb#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.

dt force-pushed the syncsst branch from b6fcae5 to 8409758 Compare January 8, 2018 20:40

dt merged commit 008193f into cockroachdb:master Jan 8, 2018

dt deleted the syncsst branch January 8, 2018 20:58

a-robinson mentioned this pull request Jan 18, 2018

addsstable: try adding fsync of smaller chunks more often #20279

Closed

jseldess mentioned this pull request Jan 22, 2018

v2.0 alpha.20180122 release notes cockroachdb/docs#2377

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: fsync sideload sst writes every 2MB #20449

storage: fsync sideload sst writes every 2MB #20449

dt commented Dec 4, 2017

cockroach-teamcity commented Dec 4, 2017

a-robinson commented Dec 4, 2017

tbg commented Dec 4, 2017

dt commented Dec 4, 2017

dt commented Dec 4, 2017

a-robinson commented Dec 4, 2017

dt commented Dec 5, 2017 •

edited

Loading

a-robinson commented Dec 5, 2017

dt commented Dec 5, 2017

dt commented Dec 5, 2017

dt commented Dec 5, 2017

a-robinson commented Dec 6, 2017

petermattis commented Dec 6, 2017

dt commented Dec 6, 2017

benesch commented Dec 12, 2017

benesch commented Dec 12, 2017

dt commented Dec 12, 2017

dt commented Dec 12, 2017

dt commented Dec 19, 2017

a-robinson left a comment

benesch commented Dec 27, 2017

storage: fsync sideload sst writes every 2MB #20449

storage: fsync sideload sst writes every 2MB #20449

Conversation

dt commented Dec 4, 2017

cockroach-teamcity commented Dec 4, 2017

a-robinson commented Dec 4, 2017

tbg commented Dec 4, 2017

dt commented Dec 4, 2017

dt commented Dec 4, 2017

a-robinson commented Dec 4, 2017

dt commented Dec 5, 2017 • edited Loading

a-robinson commented Dec 5, 2017

dt commented Dec 5, 2017

dt commented Dec 5, 2017

dt commented Dec 5, 2017

a-robinson commented Dec 6, 2017

petermattis commented Dec 6, 2017

dt commented Dec 6, 2017

benesch commented Dec 12, 2017

benesch commented Dec 12, 2017

dt commented Dec 12, 2017

dt commented Dec 12, 2017

dt commented Dec 19, 2017

a-robinson left a comment

Choose a reason for hiding this comment

benesch commented Dec 27, 2017

dt commented Dec 5, 2017 •

edited

Loading